Before starting, you will need:
In addition to Hadoop, scoobi uses sbt (version 0.12.0) to simplify building and packaging a project for running on Hadoop. We also provide an sbt plugin sbt-scoobi to allow you to contain a self-contained JAR for hadoop.
Here the steps to get started on your own project:
$ mkdir my-app
$ cd my-app
$ mkdir -p src/main/scala
We first can create a build.sbt
file that has a dependency on Scoobi:
name := "MyApp"
version := "0.1"
scalaVersion := "2.9.2"
libraryDependencies += "com.nicta" %% "scoobi" % "0.5.0-cdh4"
scalacOptions ++= Seq("-Ydependent-method-types", "-deprecation")
resolvers += "Sonatype-snapshots" at "http://oss.sonatype.org/content/repositories/snapshots"
Now we can write some code. In src/main/scala/myfile.scala
, for instance:
package mypackage.myapp
import com.nicta.scoobi.Scoobi._
object WordCount extends ScoobiApp {
def run() {
val lines = fromTextFile(args(0))
val counts = lines.flatMap(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine((a: Int, b: Int) => a + b)
persist(toTextFile(counts, args(1)))
}
}
The Scoobi application can now be compiled and run using sbt:
> sbt compile
> sbt run-main mypackage.myapp.WordCount input-files output
Your Hadoop configuration will automatically get picked up, and all relevant JARs will be made available.
If you had any trouble following along, take a look at Word Count for a self contained example.